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ABSTRACT 

This study investigates the Test of English as a 
Foreign Language (TOEFL) , in particular the relative contributions to 
score dependability (analogous to classical theory reliability) of 
various numbers of items and subtests as well as the decision 
dependability at different cut points. Research questions that apply 
to the overall TOEFL battery and to its tests and subtests address 
classical theory reliability estimates; relative contributions to 
error variance of persons, items, subtests, and their interactions; 
dependability for varying numbers of items and subtests; and the 
effect on score dependability of various cutpoints. The study was 
based on the item responses of 20,000 test takers from 15 different 
language backgrounds. Data were collected from the May 1991 
administration of the TOEFL at foreign and domestic test centers. 
Analyses included descriptive statistics, classical theory 
reliability estimates, general izabi 1 i ty theory, and decision 
dependability estimates for various cut points. Test dependability 
analyses indicate that subtests can make substantial contributions to 
the variance of test scores and thus may affect dependability in 
important ways. However, these results also make it clear that, in 
some cases, subtests may have a negligible impact on dependability. 
Thus, while inclusion of subtests or the expansion of the number of 
subtest on a test may have a substantial beneficial effect on the 
dependability of the scores on that test, this relationship cannot be 
taken as a forgone conclusion. Findings also indicate that on the 
present TOEFL the lowest dependabilities along the range are still 
very high. Analyses are appended. (Contains 20 references.) (AA) 
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DECISION DEPENDABILITY OP SUBTESTS , TESTS, AND THE OVERALL TOEFL 

TEST BATTERY 

James Dean Brown 
University of Hawaii at Manoa 

Jacqueline A. Ross 
Educational Testing Service 



ABSTRACT 

The reliability of the TOEFL battery total scores and those 
for each of the tests involved have repeatedly been shown to be 
high. In addition, the standard error of measurement has long 
been used as a means for estimating the average unreliable 
variance across all scores. The purpose of this large-scale 
study was to examine the reliability and dependability of the 
TOEFL test battery in a number of new ways. In the process, we 
wanted to investigate the relative contributions to score 
dependability (which is analogous to classical theory 
reliability) of various numbers of items and subtests as well as 
the decision dependability at different cut points. To achieve 
the above goals, four research questions were formulated. These 
research questions apply not only to the overall TOEFL battery, 
but also to the various tests and subtests that it includes: 

1. What are the classical theory reliability estimates? 

2- What are the relative contributions to error variance of 
persons, items, subtests, and their interactions? 

3. What is the dependability for varying numbers of items and 
subtests? 

4. What is the effect on score dependability of various cut- 
points? 

The study was based on the item responses of 20,000 test 
takers from 15 different language backgrounds. The data were 
collected from the May 1991 administration of the TOEFL at 
foreign and domestic test centers. The first test in the TOEFL 
battery is a listening test including three item types: statement 
items, dialogue-based items, and minitalk items. The second test 
covers two item types: structure and written expression. The 
third test consists of vocabulary and reading comprehension 
items . 

The analyses included descriptive statistics , classical 
theory reliability estimates, generalizability theory, and 
decision dependability [phi ( lambda) ] estimates for various cut 
points. The implications are discussed in terms of the 
dependability of using various combinations of TOEFL total, test, 
and subtest scores, as well as the dependability of decisions 
made at various cut points. Such issues are important because 
high decision dependability is a precondition for attaining high 
"systemic" validity. 
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INTRODUCTION 

Scores obtained on the Test of English as a Foreign Language 
(TOEFL) are frequently used to inform decisions regarding the 
readiness of nonnative speakers to pursue academic studies in 
English at colleges and universities in the United States and 
Canada, As in all measurement, the reliability of the test 
instrument and the dependability of decisions made on the basis 
of test performance are of major concern to test developers and 
test score users* The internal consistency reliability of the 
TOEFL total and individual test scores has been shown to be high 
(based on either a classical theory approach or an item response 
theory approach) , and the associated standard error of 
measurement is published as a means for decision makers to 
estimate the probable extent of error inherent in the test scores 
(ETS, 1992, pp. 30-31) . 

One useful extension of the classical theory approach to 
estimating the reliability of measurement was provided by with 
the introduction of generalizability theory by Cronbach, 
Rajaratnam, and Gleser (1963). In their model, reliability 
"resolves into a question of the accuracy of generalization, or 
generalizability" (Cronbach, Glerar, Nanda, & Rajaratnam, 1972, 
p. 15), i.e., how well one can generalize from one observation to 
a universe of observations. Generalizability (G) theory views 
the observed score as if it were the universe score, generalizing 
from the sample to the universe of interest by means of specified 
estimation procedures (Shavelson & Webb, 1981, pp. 133-137). 
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As Suen (1990, pp. 41-42) puts it, generalizability theory 
provides a conceptual framework to assess multiple sources and 
magnitudes of variation, or measurement error, within the context 
of a testing situation. In essence, analysis of variance (ANOVA) 
techniques are used to estimate components of variance associated 
with the various facets of measurement in a generalizability (G- 
study) • The ability to examine the sources of error in a 
multifaceted way provides a more comprehensive and differentiated 
explanation of variance than is possible in classical reliability 
theory (Shavelson & Webb, 1981, p. 133) . This information can be 
utilized in a decision study (D-study) wherein the results of 
various measurement designs can be manipulated . Test-design and 
score-use decisions can then be made that are based on a more 
accurate estimation of the error inherent in such choices. In 
turn, the dependability (analogous to reliability in classical 
theory) of such decisions can also be examined. All of these G- 
study and D-study techniques are amply demonstrated and 
exemplified for various statistical designs in Brennan (1983) . 

Application of G theory to language testing situations is 
discussed in Bolus, Hinofotis, and Bailey (1982) who further 
iterate the usefulness of this systematic approach to the study 
of measurement error. Brown (1984) applied G theory to the study 
of numbers of items and passages used in measuring engineering 
English reading ability in EFL situations. Then Brown and Bailey 
(1984) studied the effect of numbers of raters and scoring 
categories on the dependability of writing scores. More 
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recently, Stansfield and Kenyon (1992) applied G theory to the 
study of the effect of numbers of tests and raters on of oral 
proficiency interview scores. Brown (1990, 1993) also applied G 
theory to the problems of estimating score dependability in 
criterion-referenced language tests. 

Purpose 

The purpose of this project is to explore two dimensions of 
the TOEFL that have hitherto received little attention. First, a 
test development policy issue will be addressed. This issue 
centers on deciding how many items and subtests to include on the 
TOEFL for maximum effectiveness. Formulas like the Spearman- 
Brown prophecy formula can be used to predict the effects on test 
reliability of different numbers of items. But, such formulas 
cannot help in determining the optimal combination of numbers of 
subtests and items that ought to be included on the TOEFL. 
Fortunately, generalizability theory, discussed above, is 
particularly well suited to addressing this issue. While this 
project was primarily designed to investigate the TOEFL test as 
it exists, it is possible within the generalizability theory 
framework to also include analyses that allow the results to be 
generalized to future versions of the TOEFL (e.g., TOEFL 2000) 
and to other test development projects around the world. 

Second, while Educational Testing Service (ETS) has long 
reported the standard error of measurement (SEM) for the TOEFL to 
help score users make responsible decisions, there is one issue 
that continues to be potentially troublesome: in general, tests 
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are not equally reliable for making decisions at different cut 
points (for an overview see Feldt & Brennan, 1989, pp. 123-124). 
Conditional SEM data provided by the TOEFL test analysis reports 
indicates that the SEM is not currently the highest at the mean 
of the TOEFL test. However, since the dependability of the 
scores has been found to be lowest at the mean elsewhere 
(Brennan, 1984, pp. 312-317), and since the dependability of the 
TOEFL along the entire range of possible decisions points has not 
been demonstrated, cut-point dependability seems like an 
important, yet unresolved, issue. The second general goal of 
this project, then, is to determine whether differences in 
dependability exist at different cut points for the total TOEFL 
scores (or the individual Listening Comprehension, Structure and 
Written Expression, and Reading Comprehension and Vocabulary test 
scores that make up the battery) and to examine the degree to 
which any such differences may affect the dependability and 
therefore the validity of score users 1 decisions. 

To achieve the above goals, four research questions were 
formulated. These research questions apply not only to the 
overall TOEFL battery, but also to the various tests and sections 
that it includes: 

1. What are the classical theory reliability estimates? 

2. What are the relative contributions to error variance of 
persons, items, subtests, and their interactions? 

3. What is the dependability for varying numbers of items and 
subtests? 

4. What is the effect on score dependability of various cut- 
points? 
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METHODS 

Subjects 

The subjects in this study all come from the May 1991 
administration of the TOEFL. That administration included a 
total of 93,960 examinees with 26,371 in the United States and 
Canada and 67,589 at other test centers around the world* In 
fall 1992, the International Testing and Training Programs Area 
at ETS made available a data set (known as the "Generic Data 
Set"), which was made up of 24,500 item response records from the 
May 1991 administration of the worldwide TOEFL. For the project 
reported here, a total of 2 0,000 students were randomly selected 
(from the 24,500 records in the generic data set) for convenience 
in analyzing the results. 

Of the 20,000 subjects in this study 59.6 percent were male 
and 40.4 were female. They were involved in both domestic 
(26.2%) and foreign (73.8%) administrations of the TOEFL. They 
reported themselves to be from a total of 144 different countries 
including Brazil (3.1%), Cyprus (2.8%), France (6.0%), Germany 
(4.7%), Greece (3,7%), India (4.3%), Indonesia (8.2%), Japan 
(8.2%), Jordan (1.6%), Republic of Korea (8.1%), Lebanon (1.2%), 
Malaysia (4.2%), Mexico (1.3%), Pakistan (3.7%) , People's 
Republic of China (5.1%), Saudi Arabia (1.2%), Spain (1.9%), 
Switzerland (1.1%), Taiwan (2.3%), Thailand (8.3%), Turkey 
(5.7%) , and 123 other countries with one percent or less each 
(13.3%) . 

In terms of language background, the subjects reported 
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themselves as being speakers of Arabic (8.3%), Chinese (8,0%), 
French (8.0%), German (6.2%), Greek (6.2%), Indonesian (8.2%), 
Japanese (8.2%), Korean (8.2%), Malay (4.1%), Portuguese (4.1%), 
Spanish (8.1%), Telugu (4.0%), Thai (8.3%), Turkish (6.1%), and 
Urdu (4.1%) . 

Their reasons for taking the TOEFL varied too, as follows: 
for undergraduate studies (37.0%), for graduate studies (46.2%), 
for another type of school (2.0%), for a license (1.8%), for a 
company (8.5%), other (3.5%), and no reason given (1.0%). 

Materials 

As pointed out in ETS publications (e.g., ETS, 1992, 1993), 
the TOEFL test battery consists of three separately timed tests 
in multiple-choice format with four answer options for each test 
quest ion printed in a test book. All responses are gridded on 
answer sheets that are later computer scored. 

The first test, Listening Comprehension (LC Test) , is 
designed to measure the ability to understand spoken English. 
The first part (LCI) requires the examinees to listen to a short 
sentence and to choose the option that is closest to it in 
meaning. The second part (LC2) consists of short conversations 
between two people, followed by a spoken question. The examinee 
decides which option best answers the question. Part 3 (LC3) 
presents several short talks and extended conversations about a 
variety of subjects, and requires the examinees to respond to 
oral questions about what they heard. 

The second test, Structure and Written Expression (SWE 

9 
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Test) , is designed to measure the ability to recognize selected 
points of English structure. In the first part of this test 
(SWE1) , the examinee reads an incomplete sentence and must choose 
the word or phrase that best completes it. In the second part 
(SWE2) , several words or phrases are underlined in a sentence, 
and the examinees must choose the underlined segment that is not 
an acceptable English usage. 

The third test, Vocabulary and Reading Comprehension (VRC 
Test) , was designed to test the ability to understand the meaning 
and use of words as well as the ability to comprehend a variety 
of reading materials. The first part (VRC1) of this test 
contains vocabulary items wherein a word or phrase is underlined 
in a sentence and the examinee must select a word or phrase that 
could be substituted and still preserve the original meaning of 
the sentence. In the second part (VRC2) , the examinee reads a 
number of short passages on a variety of academic subjects and 
must answer questions based on what is stated or implied in the 
passage. 

Procedures 

The TOEFL being used here was administered under standard 
conditions in May 1991. Strict admission procedures were 
followed, and, during the test, examinees were not allowed to 
have anything other than the testing materials on their desks. 
They were not permitted to take notes or make marks of any kind 
in their test books. Nor were they permitted to work on any 
section of the test before or after time was called. 

ERIC j 0 
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After the administration, answer sheets were returned to 
Educational Testing Service (ETS) for scoring. The raw scores 
for each Test are the number of questions answered correctly. 
There is no penalty for guessing. Raw scores are then converted 
to standardized scales based on the three-parameter item response 
theory model (T scores for the individual tests and CEEB scores 
for the battery as a whole) . These scaled scores are reported to 
the examinees and to institutions that the examimees have 
selected to receive scores. 

Analyses 

The analyses in this project began with descriptive 
statistics and classical theory reliability estimates (split-half 
adjusted, Guttman, and Cronbach alpha) to provide background and 
a context for interpreting the generalizability studies (G- 
studies) and decision studies (D-studies) . 

[INSERT FIGURE 1 ABOUT HERE] 
Five G-studies were conducted based on the overall structure 
of the TOEFL shown in Figure 1. The first G-study investigated 
the effects on the Total TOEFL battery scores dependability of 
numbers of items (items facet) and numbers of test types 
(subtests facet based on the Listening Comprehension, Structure 
and Written Expression, and Reading Comprehension and Vocabulary 
tests) as shown in Figure 2A. The second, third, and fourth G- 
studies considered the effects on total test scores for the 
Listening Comprehension Test (LC Test) , Structure and Written 
Expression Test (SWE Test), and Reading Comprehension and 
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Vocabulary Test (VRC Test) of numbers of items and subtests (made 
up of different item types) on those tests as shown in Figures 2B 
through 2D. The fifth G-study focused on the Reading 
Comprehension section (VRC2) of the VRC Test. In this case, the 
effects of numbers of items and subtests (passages P1-P5) was 
investigated as shown in Figure 2E. 

[INSERT FIGURES 2A-2E ABOUT HERE] 

All of these G-studies were very similar in structure. In 
all cases, and analysis of variance (ANOVA) procedures were run 
using all 20,000 subjects for a persons by items nested within 
subtests design, or p x (i:s) . The result in all cases was a two 
facet design with items and subtests as the facets. Random 
effects models were used in the ANOVAs so that the results would 
be generalizable to the development of the TOEFL 2000 project as 
well as to other test development projects around the world. 
However, in some places mixed model ANOVAs with fixed effects for 
the subtest facet were also used so that the results for the 
current configuration of the TOEFL could be examined. 

In the random effects model, it is assumed that persons, 

items, and subtests were randomly selected from the universes of 

all possible persons, items, and subtests. Shavelson and Webb 

(1981) argued that random effects models are reasonable if one 

can take an exchangeability perspective: 

Viewed from the exchangeability perspective, the issue of 
fixed or random effects is not whether one can catalog (etc.) 
all possible members of a population but whether the members 
are exchangeable with other potential members. In terms of 
sampling, if one set of persons and items to which p 2 



generalizability coef f icient . . . is generalizable is the set of 
such persons and items jointly exchangeable with the present 
sample , it is reasonable to consider the item facet random. 
The concept of exchangeability, at the minimum, provides 
reasonable grounds for considering whether a facet is random 
or fixed. 

Thus for those results in this paper that are based on a random 
effects model, random selection of items and subtests is assumed, 
while, for those results based on a mixed model (with fixed 
effects for subtests) , no such assumption is made for the 
subtests facet. 

Based on the mean squares obtained in the random effects 
model ANOVA procedures, variance components were estimated (as 
will be demonstrated in the RESUL1S section) . Interpreting these 
variance components helped in understanding the relative 
contribution of persons to the true score variance, as well as 
the contributions of items and subtests to the error variance. 

Five parallel D-studies followed the G-studies. In these D- 
studies, the variance components found in the G-studies were used 
to calculate statistics that can be directly interpreted in 
making decisions. Two types of error were considered: a) lower- 
case delta error (S) for relative decisions (i.e., norm- 
referenced decisions) , and b) upper-case delta error (A) for 
absolute decisions (i.e., criterion-referenced). All relevant D- 
study statistics are reported in the RESULTS section for the 
combination of items and tests under investigation in this 
project. In addition, G coefficients (based on lower case delta) 
are renorted for various numbers of items and subtests so that 



the reader can directly observe the effect on dependability of 
these two facets in various combinations of numbers of items and 
subtests . 

The last step in each D-study was to calculate a squared- 
error loss agreement coefficient known as the phi(lambda), or 
$(X), at various cut points from 10 percent to 90 percent. These 
analyses illustrate the effect of various cut points on decision 
dependability. Phi (lambda) coefficients were calculated for both 
a random effects model (to provide generalizability of results to 
other tests) and a mixed model with fixed effects for subtests 
(to provide estimates for the TOEFL as it existed in this study) . 

RESULTS 

The results cf this project will be discussed in the 
following stages with commensurate section headings: a) 
descriptive statistics for each of the five generalizability 
studies will be provided for background; b) classical theory 
reliability estimates will be presented for later comparison with 
the G-theory results; c) the variance components for the five G- 
studies will be presented and compared; d) the five parallel D- 
study results will be presented along with G coefficients for 
various numbers of items and subtests; and finally, e) threshold- 
loss agreement coefficients will be given for different cut 
points within each of the D-studies. 

Descriptive Statistics 

The descriptive statistics for the raw scores involved in 
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each of the five generalizability studies are reported in Table 
1. According to the labels across the top of the table, the 
mean, standard deviation (SD) , and number of items (k) are given 
for the original test and for the G-study sampling. The original 
test includes the subtests and numbers of items just as they were 
administered. The G-study sampling results are based on the 
random samples that were taken from the original test to create 
balanced subtests (each containing the same number of items) for 
the generalizability studies. 

[INSERT TABLE 1 ABOUT HERE] 

The first G-study was on the effects of items and tests on 
the dependability of Total TOEFL battery scores. Thus 
descriptive statistics are given for the Total TOEFL and each of 
the tests which contribute to that total score: Listening 
Comprehension (LC Test) , Structure and Written Expression (SWE 
Test) , and Reading Comprehension and Vocabulary (VRC Test) . 
Notice that the original TOEFL had a total of 146 items and that 
the original LC, SWE , and VRC tests had 50, 38, and 58 items, 
respectively. In order to create a balanced design, two of the 
tests had to be reduced in number of items to match the smallest 
of the tests. To achieve this, 38 items were randomly selected 
from the LC and VRC tests to match the existing 38 items in the 
SWE Test. As a result, in the first G-study, all three Tests 
were analyzed as 3 3 item tests with a TOEFL total of 114 items. 

The second G-study was focused on the effects of items and 
subtests on the dependability of LC Test scores. Thus 
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descriptive statistics are given in Table 1 for the whole LC Test 
and each of the subtests which contribute to the LC Test scores: 
LCI, LC2 , and LC3 . Notice that the original LC Test had a total 
of 50 items and that the original LCI, LC2 , and LC3 sections had 
20, 15, and 15 items, respectively. In order to create a 
balanced design, the longer section had to be reduced in number 
of items to match the other two sections. To achieve tttis, 15 
items were randomly selected from the LCI to match the existing 
15 items in both the LC2 and LC3 sections. As a result, in the 
second G-study, all three sections were analyzed as 15 item 
subtests with an LC Test total of 45 items. 

The third G-study was on the effects of items and subtests 
on the dependability of SWE Test scores. Thus descriptive 
statistics are given for the whole SWE Test and each of the two 
sections which contribute to the SWE Test scores: SWE1 and SWE2 . 
Notice that the original SWE Test had a total of 3 8 items and 
that the original SWE1 and SWE2 sections had 14 and 24 items, 
respectively. In order to create a balanced design, 14 items 
were randomly selected from the SWE 2 section to match the 
existing 14 items in the SWE1 section. As a result, in the third 
G-study, the two sections were analyzed as 14 item subtests with 
a SWE Test total of 28 items. 

The fourth G-study was on the effects of items and subtests 
on the dependability of VRC Test scores. Thus descriptive 
statistics are given for the whole VRC Test and each of the two 
sections which contribute to the VRC Test scores: Vocabulary and 



Reading Comprehension. Notice that the original VRC Test had a 
total of 58 items and, since each of the sections had 29 items, 
it was already balanced. Thus no modifications were necessary in 
preparing it for the fourth G-study. 

The fifth G-study was on the effects of items and passages 
within the Reading Comprehension section (VRC2) on the 
dependability of VRC2 section scores. Thus descriptive 
statistics are given for the whole VRC2 section and the items 
associated with each of the passages which contributed to the 
VRC2 section scores: Passages 1 to 5. Notice that the original 
VRC 2 section had a total of 29 items and that the original 
passages had 7, 5, 7, 6, and 4 items associated with them, 
respectively. In order to create a balanced design, the passages 
with larger numbers of items had to have the number of items 
reduced to match the shortest passage (i.e., Passage 5 with four 
items) . To achieve this, four items were randomly selected from 
those associated with each of the larger passages. As a result, 
in the fifth G-study, all five passages were analyzed as four 
item sections with a VRC 2 section total of 2 0 items. 

Classical Theory Reliability 

Classical theory reliability estimates are presented in 
Table 2. For ease of interpretation, Table 2 is organized in the 
same general manner as Table 1. The first classical theory 
reliability estimate given is the split-half correlation adjusted 
by the Spearman-Brown prophecy formula. Then the Guttman 
reliability is given followed by the Cronbach alpha coefficient. 
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Notice that the first two estimates are consistently lower than 
the Cronbach alpha coefficients. Since theory indicates that the 
first two are more likely to be underestimates, the single best 
estimate is *che Cronbach alpha. These estimates are given for 
the Original Tests and the G-study Samplings (along with the 
numbers of items, or k) so that the effect of the reductions in 
test length on classical theory reliability can readily be seen. 

[INSERT TABLE 2 ABOUT HERE] 

Variance Components 

Based on ANOVA procedures (shown in Appendix A) , G theory 
allowed for estimation of the relative contributions of persons, 
items, and subtests in terms of variance components. For 
example, for the first G-study of the Total TOEFL Battery, which 
was a p x (i:s) design (like all of the others), the ANOVA 
results are shown in Table 3. 

[INSERT TABLE 3 ABOUT HERE] 

Based on the variance components that make up the estimated 
mean squares (EMS) as shown in Brennan (1983) or Kirk (1968), the 
variance components for persons as well as for the items and 
subtests facets were isolated from the observed mean squares 
(MS) . The EMS shown in Table 3 were used systematically to 
derive the variance components in a step-by-step manner. First, 
because the estimated variance component for the interaction of 
persons and items nested within subtests, or d 2 (pi:s) , is equal 
to the MS(pi:s) for that interaction, .16180465 in this case, 
that variance component is easy to isolate. Formulaically , this 
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process can be summarized as follows: 
d 2 (pi:s) = MS(pi:s) 

Second, because, as is shown in Table 3, it is known that 
the EMS for the ps interaction = a 2 (pi:s) + n L o 2 (ps) , the 
estimated variance component for this interaction, 6 2 (ps) , could 
be isolated by subtracting the MS(pi:s) from the MS(ps), and 
dividing the result by the number of items, n if in each subtest 
[i.e., (.43545427 16180465) /38 in this case], Formulaically : 

6* (ps) = [MS(ps) - MS(pi:s) ] /n i 

Third, fourth, and fifth, using the known mathematical 
relationships shown in Table 3, the other three variance 
components in this design could then be calculated by using the 
following formulas : 

d 2 (p) = [MS(p) - MS (ps) ] /n i n s 

d 2 (i:s) = [MS(i:s) - MS (pi:s) ] /n p 

d 2 (s) = [MS(s) - MS(i:s) - MS(ps) + MS (pi:s) ] /n^ 

Note that the calculations in this example were based on MS 
values that have been rounded to eight places. Because the 
resulting variance components are often very small values, it was 
essential that nothing be rounded any more than was necessary 
until the final result was obtained. 

[INSERT TABLE 4 ABOUT HERE] 
The variance components for each of the G-studies in this 
project (all calculated in similar manner) are shown in Table 4. 
Notice that the five G-studies are labeled across the top as 
columns and that the sources of variance (p, s, i:s, ps, and 
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pits) are labeled at the left as rows. The totals in the last 
row represent the sums of the variance components isolated in 
each study. 

D-Study Results and Generalizability Coefficients 

Summaries of the statistics found in the five D-studies are 
presented in Table 5. Notice that each D-study is presented in a 
separate column as labeled across the top of the table. The rows 
represent each of the statistics. First, the number of subtests 
(n s ) is given, then the number of items per subtest (nj , then 
the total number of items (when the number of subtests is 
multiplied times the number of items per subtest) . Then the 
estimated variance components (adjusted for the number of items 
and subtests in the particular D-study) are given for p, s, i:s, 
and their interactions. Notice that the variance components for 
p are the same as those reported in Table 4, while the variance 
components for s, i:s, and their interactions are different in 
the two tables because those in Table 5 have been adjusted for 
the numbers of items or subtests in the particular D-study design 
(after Brennan, 1983). In the next row, the mean proportion 
scores (X p ) are given. These means are simply the average of 
each persons proportion score, which is calculated by dividing 
the number of correct responses by the number of items (but not 
moving the decimal two places to the right as would be done in 
calculating a percent score) . 

[INSERT TABLE 5 ABOUT HERE] 
Next, statistics are given for a random effects model. The 
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random effects model estimates allow generalization of the 
results to other tests as discussed above. The statistics for 
this model include 6 2 (t), which is just another expression of 
6 2 (p) . The upper-case delta error term, 6 2 (S) , (for relative 
decisions, i.e., norm-referenced interpretations) and the lower- 
case delta error term, 0 2 (A) (for absolute decisions, i.e., 
criterion-referenced or domain referenced interpretations) are 
also given. Then the expected observed score variance, Ed 2 (X) , 
and error variance associated with the grand mean, d 2 (X) , are 
presented. All of these statistics were used in calculating the 
generalizability coefficients for lower-case delta (norm- 
referenced) error, Ep 2 (S) , in the S/N ratios reported in this 
table, or in the phi (lambda) coefficients reported in the next 
section* 

The G-coef f icients, Ep 2 (S) that are presented in Table 5 are 

analogous in interpretation to reliability coefficients. They 

are calculated by forming a ratio of the persons variance 

component for the particular number of subtests and items in the 

G-study over the same persons variance plus the appropriate error 

term. Thus G-coef f icients for relative decisions would use S 

error as follows: 

6 2 (r) 6 2 (p) 

Ep 2 (S) = = 

<5 2 (T) + <* 2 (6) 6 2 (p) + <5 2 (S) 

Similarly, G-coef f icients for absolute decisions would use A 



error as follows: 

d 2 (r) 6 2 (p) 

Ep 2 (A) = = 

d 2 (r) + 6 2 (A) d 2 (p) + d 2 (A) 

The last statistic presented in the Random Effects part of 

Table 5 is the signal to noise ratio (S/N) . This statistic can 

be interpreted as the ratio of systematic variance to random 

error (Brennan & Kane, 1977), or, as Cronbach and Gleser (1964: 

468) put it in an earlier discussion of communications systems, 

the "signal to noise ratio compares the strength of the 

transmission to the strength of the interference, 11 

At the bottom of the table, the same statistics are 

presented for a mixed effects model (with subtests as a fixed 

effect) . These results can only be generalized to the TOEFL 

battery as it was structured and studied here. 

Notice that, as would be expected, the generalizability (or 

G) coefficients [Ep 2 (S)] for the mixed model are very similar to 

the Cronbach alpha values reported in Table 2 for the G-study 

sampling (.9584, .9077, .8686, .9326, and .8280, respectively) * 

In addition, probably because of differences in numbers of items, 

these G-coef f icients are slightly lower than the corresponding 

Cronbach alpha values reported in Table 2 for the Original test 

(.9667, .9178, .9016, .9326, and .8769, respectively). 

Naturally, the G-coef f icients for the random effects model 

are more conservative than those for the mixed model because the 

random effects statistics can be generalized beyond the items and 

subtests of the current TOEFL to other batteries and tests. 



Tables 6 to 10 were created by expanding this random effects G- 
coefficient information. Each of these tables corresponds to one 
of the D-studies in this project and gives the coefficients that 
would arise from different numbers of items and subtests. 

[INSERT TABLE 6 ABOUT HERE] 
For instance, Table 6 is for the Total TOEFL battery and 
shows that the G-coef f icient for 3 subtests with 3 8 items each 
(see the point where the 38th row and third column of 
coefficients intersect) is .892 which is equivalent (though 
rounded) to the random effects model G-coef f icient of .8916 
reported in Table 5. Notice in the bottom left corner of the 
table that the battery configured with the same 114 items but in 
one subtest instead of three is estimated to be dependable at 
.785, with two subtests of 57 (total 114), it is predicted to be 
.862 and, with three subtests of 38 (as shown at the top of this 
paragraph), it would be .892. Thus the effects of having the 
items divided up into smaller and smaller subtests are 
demonstrated . 

Clearly, there is considerable dependability gained from 
having the TOEFL battery made up of three different subtests 
rather than of one long, homogeneous test. In other words, there 
is an increase in dependability due to increases in the number of 
subtests involved while holding the number of items constant. 
Moreover, these increases are above and beyond predictions that 
could be made by using formulas like the Spearman-Brown prophecy 
formula used in classical theory reliability studies. 



Table 6 also allows for considering other potential 
combinations of numbers of items and subtests as part of the D- 
study to help in deciding what is the optimal number of items and 
subtests to include in future versions of this and other tests. 
For instance, by looking at the point where six subtests 
intersects with 19 items (also for a total of 114 items) , the 
table reveals that a G-coef f icient of .923 is predicted. 

However, for actual policy decisions, factors other than 
dependability must come into play. For instance, 100 tests with 
seven items each are predicted to be dependable at .99, but such 
a 7 00 item test is not practical even though the dependability 
would be near perfect. Thus these dependability estimates for 
various numbers of items and subtests are meant to provide one 
piece of information among the many types of information that 
must be considered in making test development decisions* 

[INSERT TABLE 7 ABOUT HERE] 
Turning to Table 7 for the LC Test, notice that a single 45 
item test would be dependable at .882, while a similar 45 item 
test based on three subtests of 15 items each would only be 
slightly more dependable at .899, and a 45 item test based on 
five subtests of nine items each would only gain .004 points at 
.903. Thus the pay off in terms of gains in dependability due to 
increases in the number of subtests (while items are held 
constant) appear to be minimal for the LC Test. 

[INSERT TABLE 8 ABOUT HERE] 
Similarly, in Table 8 for the SWE Test, a 28 item test with 



only one subtest would be dependable at .836, while a similar 28 
item test based on two subtests of 14 items each would only be 
slightly more dependable at .851, four subtests of seven items 
each would only be .859, seven subtest of four items each would 
only be .862, and fourteen subtests of two items each would be 
.865. In short, there is not nearly as much to gain by dividing 
the SWE Test into subtests — certainly not beyond two subtests 
— as there was in the Total TOEFL battery. 

[INSERT TABLE 9 ABOUT HERE] 
Table 9 for the VRC Test is somewhat different. The table 
seems to indicate that considerable dependability is gained by 
splitting the 58 items into two subtests, i.e., the one-subtest, 
58-item dependability is .858, while the two-subtests version (of 
29 items each) dependability is considerably higher at .893. 
However, a three-subtests version (of 20 items each) would only 
increase to .9 07 even though it is two items Tonger, and a four- 
subtests version (of 15 items each) would only increase further 
to .914. Thus, like the SWE Test results, it appears that the 
present two-subtest version of the VRC Test may include as many 
subtests as are necessary and practical. 

[INSERT TABLE 10 ABOUT HERE] 
Table 10 for the VRC 2 section is more like the Total TOEFL 
in terms of the impact of subtests cn dependability. For 
instance, the one-subtest, 20-item dependability is .650, while 
the two-subtest version (with 10 items each) is considerably 
higher at .729, and the four-subtest version (with 5 items each) 

9^ 



climbs to .776* Thus differences in the numbers of passages 
involved in VRC2 section appear to be relatively important to its 
over a 1 1 dependabi 1 ity . 

More D-study Results: Phi (lambda) Dependability Coefficients 

Threshold loss agreement coefficients focus on the degree to 
which classifications in clear-cut categories have been 
consistent. Since it is known that such dependability may vary 
at different cut points (Brennan 1980, 1984) and since TOEFL is 
widely used as an admissions tool for admit/no-admit decisions 
(though at different cut points) , one of the research questions 
in this study was the degree to which the dependability of TOEFL 
changes over the range of possible cut points. 

[INSERT TABLE 11 ABOUT HERE] 
Table 11 gives the Phi(lambda), or $(\), coefficients for 
various cut points (in percentage terms). In all cases, these 
coefficients are based on the p x its design and (A) error (as 
suggested by Brennan, 1984) and are therefore more conservative 
than the (5) error estimates would have been. Notice that such 
coefficients are reported for both random effects models and 
mixed models (with subtests as a fixed effects facet) . In each 
set, the lowest value reported was that for a cut point at the 
mean. Hence the $(\) values for the cut point at the mean $(X) 
are reported below all of the others in each type of model and 
the mean percentages (upon which the $(\) values are based) are 
given for reference. 



To interpret this table, it is first necessary to decide 
whether it is results that are generalizab] e to other tests that 
are of interest (Random Effects Model) , or results that pertain 
only to the present TOEFL items and subtests that are of interest 
(Mixed Effects Model) . Consider the Mixed Effects Model for the 
present TOEFL battery as a whole presented in the bottom half of 
the first column. Notice that $(\) coefficients are presented 
for decisions made at 10%, 20%, etc. up to 90%. Notice further 
that the lowest of these is .957 at the 70% cut point. It turns 
out here and in the other columns that the lowest value will be 
that closest to the mean. In fact, decisions made at the mean 
will generally turn out to be the least dependable. Hence, the 
$(X) at the mean is presented along with that mean in the last 
two rows of both the upper and lower portions of Table 11. 

DISCUSSION 

In interpreting the above results, it is important to 
remember that most of the dependability estimates (i.e., all 
except those found in the VRC Test analyses in Study Four) are 
based on fewer items than actually used in the tests because it 
was necessary to design the various studies so that there would 
be equal numbers of items on each subtest. Since shorter tests 
tend to be less reliable, the effect of these reduced numbers of 
items (if there is any) would be to provide low estimates of 
dependability. As a result, it is reasonable to interpret the 
results as conservative underestimates of the true state of 
affairs. In other words, if the dependability estimates are in 
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error, they will err on the low side and should not provide 
overestimates of the dependability of these measures. 

The remainder of this discussion will directly address the 
original research questions posed at the outset of this project. 
To help organize the discussion, the research questions will be 
used as headings . 

What are the classical theoxry reliability estimates? 

As reported elsewhere in the literature, the Total TOEFL 
battery and its component tests — the LC Test, SWE Test, and VRC 
Test — proved to be very reliable from a classical theory 
perspective. The results in Table 2 indicate that these tests in 
their existing form (labeled Original Test in the table) were 
reliable at .97, .92, .90, and .93, respectively, using Cronbach 
alpha. Predictably, the VRC 2 section, which was only a portion 
of the VRC Test, was less reliable at .8769 than the tests and 
battery considered above because it is considerably shorter than 
they are. For the sake of comparison, Table 2 also presents the 
classical theory estimates for the items used in the G-study 
sampling (done to create balanced designs) . These Cronbach alpha 
estimates later turned out to be comparable to the G-coef f icients 
(for S error) for the mixed models as would be expected. 

What are the relative contributions to error variance of persons, 
items, subtests, and their interactions? 

Examining the variance components shown in Table 4 for the 

five G-studies in terms of their relative magnitude reveals the 

relative contributions of persons, subtests, and items nested 

er|c 2G 
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within subtests, as well as their interactions. For instance 
from inspection of the variance components themselves, it is 
clear that the lion's share of variance in all of these studies 
is taken up by persons and those interactions involving persons. 
This is as it should be because the purpose of a norm-referenced 
test is to differentiate among persons. However, it should be 
noted that the variance component for the persons by subtests 
interaction is far smaller than that for the persons by items 
nested within subtests interaction — though the persons by 
subtests interaction is fairly high in Study Five. It is also 
true in all cases that the variance component due to items nested 
within subtests is far larger than the component for subtests. 
Particularly in Study Two (LC Test) and Study Three (SWE Test), 
the subtests variance component is very small. The subtests 
component is somewhat larger in Study Four (VRC Test) . However, 
in Study One (Total TOEFL) , the variance component for subtests 
is, much more important, amounting to about one-twelfth of the 
persons component and about one-quarter of the items nested 
within subtests component. In Study Five (VRC2) , the subtests 
variance component is even more important since it is almost one- 
fifth as large as the persons component and almost equal in 
magnitude to the items nested within subtests component. These 
observations will be further illuminated in the next section. 

What is the dependability for varying numbers of items and 
subtests? 

Tables 6 to 10 provided a multitude of direct answers to 
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this research question. In all cases, the subtests facet was 
shown to have some influence on the predicted dependability 
indices as indicated by the fact that in no D-study was the 
dependability the same for one subtest and more than one subtest 
with the number of items heid constant. In other words, in all 
cases the dependability was enhanced by having an increased 
number of subtests even though the number of items was kept the 
same . 

However, a pattern emerged in examining the results across 
tables which was consistent with the variance component findings 
in the previous section. The influence of subtests was greatest 
in the Studies One (Total TOEFL) and Five (VRC2) , and to a lesser 
degree in Study Four (VRC Test) . In considering the Total TOEFL 
results, it might at first glance appear that the affect would be 
larger here than in the other studies because the length of the 
subtests themselves were longer at 38 items each than in any of 
the other studies. However, this reasoning is contradicted by 
the fact that an even larger effect for subtests was found in the 
VRC 2 results, which was based on four items in each subtest — 
the smallest number of items per subtest reported in any of the 
D-studies in this project. 

It should be noted that Studies One and Five were quite 
different from each other in structure. The. relatively large 
differences in dependability due to subtests in Study One were 
due to differences between tests (i.e., the LC Test, SWE Test, 
and VRC Test) , while those observed for Study Five were due to 



differences between reading passages (i.e., Passages 1 to 5) . 

What is the effect on score dependability of various cut-points? 

The results shown in Table 11 for D-studies One, Two, and 
Four indicate that, for the existing (i.e., using a Mixed Effects 
Model) Total TOEFL battery, LC Test and VRC Test, the 
dependability of decisions is not greatly different at various 
cut points, and in any case, at the lowest point they are 
acceptably dependable (at .957, .904, and .928, respectively). 
The third D-study indicates that the dependability at the mean is 
more markedly different at .861 from the dependabilities at other 
cut points. Thus, though .8 61 is not problematic dependability, 
it would be most responsible to apply additional caution in 
interpreting decisions on the SWE test that are at or near the 
mean (approximately 50 on the standardized scores). [It is also 
important to note that the .861 found here is probably an 
underestimate of the existing state of affairs because it is 
based on two subtests of 14 items while the original subtest was 
based on 2 subtests of 14 and 24 items.] In short, decisions 
based on the current TOEFL battery and individual tests of the 
TOEFL can still be considered dependable even if those decisions 
are made right at the mean score (of approximately 500 for the 
battery or approximately 50 for the separate tests) . 

In the upper portion of Table 11, the Random Effects Model 
results turned out to be more conservative than the Mixed Effects 
results, showing both lower dependability in general and a more 
marked decline in the dependability at and near the mean. Recall 
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that this difference was expected due to the fact that these 
results are generalizable to other test development projects. 

CONCLUSION 

Test Dependability 

The effects on dependability of different numbers of 
subtests and items (based on the random effects model) are shown 
in Table 4 as variance components and in Tables 6 through 10 as 
G-coef f icients. One pattern that emerged is that the effect of 
having multiple tests (i.e., the subtest facet) on the Total 
TOEFL battery seems to have a strong beneficial effect on the 
dependability of scores for the Total battery. In other words, 
including component tests like the LC Test, SWE Test, and VRC 
Test in the Total TOEFL battery has proven to be a sound policy 
decision from the dependability perspective. In addition, based 
on Table 6, further policy decisions can be made about the 
relative merits of adding further items and/ or component tests or 
cutting down on their numbers. 

Similarly, the effect of having multiple passages (the 
subtest facet) in the VRC 2 section seems to have a positive 
advantageous effect on the dependability of scores for the VRC 2 
section. To some degree, the effect of having both the reading 
comprehension and vocabulary subtests on the VRC Test also 
appears to have a beneficial effect on the dependability of this 
test — though the strong increases in dependability do not 
appear to extend beyond two such subtests. In contrast, the 
individual subtests within the LC Test and SWE Test, while they 
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do make some difference, appear to have less impact on the 
dependability of the scores on these tests. 

It is possible that in G-studies one, four and five, where 
the subtests facet did have an important impact, the subtests 
involved were significantly different from each other and thus 
contributed to the overall variance on the test above and beyond 
the contribution made by items. In contrast, in G-studies two 
and three, the subtests involved may be testing very much the 
same things. 

In terms of developing future versions of the TOEFL 
(including TOEFL 2000) and other test development projects around 
the world, recall that the results presented in Tables 4, and 6- 
10 were for random effects models and that they were therefore 
generalizable to other versions of the test and other testing 
projects. in short, the analyses in this project indicate that 
subtests can make substantial contributions to the variance of 
test scores and thus may affect dependability in important ways. 
However, these results also make it clear that, in some cases, 
subtests may have a negligible impact on dependability. Thus, 
while inclusion of subtests or the expansion of the number of 
subtests on a test may have a substantial beneficial effect on 
the dependability of the scores on that test, this relationship 
cannot be taken as a forgone conclusion. 

Decision Dependability and Validity 

The results of this study are also related to the notions of 
decision dependability and validity. At the beginning of this 
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paper, concern was expressed about the possibility that test 
scores may not be equally reliable for making decisions at 
different cut points in the score range. Since the dependability 
of a test is often lowest at the mean and since many decisions 
are made at or near the mean on TOEFL, this was a legitimate 
concern. Portions of this project were therefore designed to 
examine the degree to which these differences may affect the 
dependability and therefore the validity of score users 1 
decisions. The lower portion of Table 11, which reports the $(\) 
coefficients when a mixed effects model is applied, indicates 
that on the present TOEFL the lowest dependabilities along the 
range are still very high. Thus, while it initially seemed like 
a potential problem for score users, there appears to be no need 
to worry about differential dependability at different cut points 
on the existing test. In other words, regardless of the cut 
point that current TOEFL score users may decide to be valid for 
their own reasons, the effect on dependability of various cut 
points is apparently not an issue of great concern. In 
addition, ETS is currently implementing automated item selection 
procedures to assemble TOEFL tests which will help to insure that 
each section will provide high information (or low error 
variance) at the middle ability range. Naturally, any such 
validity decisions should be also studied in the actual 
context (s) in which the decisions are to be made. 

In terms of future versions of the TOEFL and other testing 
projects around the world, the upper portion of Table 11, which 
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reports the *(\) coefficients when a random effects model is 
applied, indicates that there may be more variation in 
dependability estimates across the range of possible decision 
points. Thus, while such differential dependability is 
apparently not a problem on the current TOEFL, it is an issue 
that should continue to concern developers of other tests and 
future versions of the TOEFL. 



Future Research 

In the course of conducting this project, a number of 
questions have occurred to us. They are presented here in the 
hope that they will be investigated in the future: 

1. Would similar results be obtained if these studies were 
replicated with other TOEFL data sets? 

2. Would similar results be obtained if such studies were 
replicated using other tests as the basis? 

3. What could generalizability theory tell us about the 
effects of raters on the scores of the Test of Written 
English? 

4. What could generalizability theory tell us about the 
effects of items and raters on the scores of the Test of 
Spoken English? 

5. What could be learned about the TOEFL battery and other 
tests by applying classical theory approaches to decision 
reliability/dependability at different cut points (for an 
overview of these approaches, see Feldt & Brennan, 1989, 
pp. 123-124)? 



ERLC 



35 



34 



REFERENCES 

Bolus, R.E., F.B. Hinofotis & K.M. Bailey, (1982)- An introduction 
to generalizability theory in second language research. 
Language Learning, 32, 245-258. 

Brennan, R.L. (1980) . Applications of generalizability theory. In 
R.A. Berk (Ed.) Criterion-referenced measurement: the state of 
the art. Baltimore: Johns Hopkins University Press. 

Brennan, R.L. (1983) . Elements of generalizability theory. Iowa 
City, IA: American College Testing Program. 

Brennan, R.L. (1984). Estimating the dependability of the scores. 
In R.A. Berk (Ed.) A guide to criterion-referenced test 
construction* Baltimore: Johns Hopkins University Press. 

Brennan, R.L. & M.T. Kane. (1977). Signal/noise ratios for 
domain-referenced tests. Psychometrika, 42, 609-625. 

Brown, J.D. (1984) . A norm-referenced engineering reading test. 
In A.K. Pugh & J.M. Ulijn (Eds.) Reading for professional 
purposes: studies and practices in native and foreign 
languages. London: Heinemann Educational Books. 

Brown, J.D. (1990) . Short-cut estimators of criterion-referenced 
test consistency. Language Testing, 7, 77-97. 

Brown, J.D. (1993) . A comprehensive criterion-referenced testing 
project. In D. Douglas & C. Chapelle (Eds.) A new decade of 
language testing research (pp. 163-184) . Alexandria, VA: 
TESOL. 

Brown, J.D. & K.M. Bailey. (1984). A categorical instrument for 
scoring second language writing skills. Language Learning, 34, 
4, 21-42. 

Cronbach, L.J. & G.C. Gleser (1964). The signal/noise ratio in 
the comparison of reliability coefficients. Educational and 
Psychological Measurement, 24, 467-480. 

Cronbach, L.J. , G.C. Gleser, H. Nanda, & N. Rajaratnam. (1972) . 
The dependability of behavioral measurements: Theory of 
generalizability for scores and profiles . New York: John 
Wiley. 

Cronbach, L.J., N. Rajaratnam & G.C. Gleser (1963). Theory of 
generalizability: A liberalization of reliability theory. 
British Journal of Statistical Psychology, 16, 137-163. 

ETS. (1992). TOEFL test and score manual. Princeton, NJ: 
Educational Testing Service. 



ERLC 



30 



ETS . (1993). Bulletin of information for TOEFL, TWE, and TSE. 
Princeton, NJ: Educational Testing Service, 

Feldt, L.S. & R.L. Brennan. (1989). Reliability. In R.L. Linn 

(Ed.). Educational measurement (3rd ed.). New York: Macmillan. 

Hudson, T. & B. Lynch. (1984) . A criterion-referenced approach to 
ESL achievement testing. Language Testing, 1, 171-201. 

Kirk, R.E. (1968) . Experimental design: Procedures for the 
behavioral sciences. Belmont, CA: Brooks/Cole. 

Shavelson, R.J. & N.M. Webb (1981). Generalizability theory: 
1973-1980. British Journal of Mathematical and Statistical 
Psychology, 34, 133-166. 

Stansfield, C.W* & D.M. Kenyon (1992). Research of the 

comparability of the oral proficiency interview and the 
simulated oral proficiency interview. System, 20, 347-364. 



Suen, H.K. (1990) . Principles of test theories . Hillsdale, NJ: 
Lawrence Er lbaum . 



36 



FIGURE l: TOEFL STRUCTURE 
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FIGURES 2A-2E: G-STUDY STRUCTURES 
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TABLE 1: DESCRIPTIVE STATISTICS 
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STUDY 

BATTERY 
TEST 

SUBTEST 
PASSAGE 



ORIGINAL TEST 



MEAN 



STD 



G-STUDY SAMPLING 



MEAN 



STD 



STUDY ONE 

TOTAL TOEFL 
LC TEST 
SWE TEST 
VRC TEST 



99 . 6788 
31. 1660 
27.5541 
40.9588 



27.2798 
10. 5508 
7. 6467 
11. 6509 



146 
50 
38 
58 



78.9712 
24.0095 
27.5541 
27.4077 



21. 3691 
8. 1238 
7.6467 
7.7959 



114 
38 
38 
38 



STUDY TWO 

LC TEST 
LCI 
LC2 
LC3 



31. 1660 
12.6897 
9.5017 
8.9746 



10.5508 
4.6491 
3.3634 
3 .4902 



50 
20 
15 
15 



27.7185 
9.2422 
9.5017 
8.9746 



9.5038 
3.5805 
3.3634 
3 .4902 



45 
15 
15 
15 



STUDY THREE 
SWE TEST 
SWE1 
SWE2 



27.5541 
10. 5301 
17.0240 



7.6467 
2.9676 
5. 1066 



38 
14 
24 



20.4521 
10.5301 
9.9220 



5. 6912 
2.9676 
3.1298 



28 
14 
14 



STUDY FOUR 

VRC TEST 
VRC1 
VRC 2 

STUDY FIVE 



40.9588 
20.5785 
20.3804 



11. 6609 
6.2482 
6. 0880 



58 
29 
29 



40.9588 
20. 5785 
20. 3804 



11. 6609 
6.2482 
6.0880 



58 
29 
29 
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20 


.3804 


6. 


0880 


29 


13 


.8102 


4. 


3375 


20 


PASSAGE 
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5 


.5312 


1. 


5248 
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. 1075 


0. 


9972 
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.9047 


1. 


2311 
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.0894 


1. 


0408 


4 
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.4653 


2 . 


0514 


7 


2 


.5816 


1. 


2644 


4 


PASSAGE 


4 


4 


.2770 


1. 


7342 


6 


2 


.8295 


1. 


2352 


4 


PASSAGE 


5 


2 


.2022 


1. 


3318 


4 


2 


.2022 


1. 


3318 


4 
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TABLE 2: CLASSICAL THEORY RELIABILITY STATISTICS 



STUDY 

BATTERY 
TEST 



SUBTEST 






ORIGINAL 


TEST 




G- 


STUDY 


SAMPLING 




on 


OUU UiUo.il 


A 1 y"\Vi i» 
r\ X jpiici 




0 n 


CZi 1 +■ +■ in a ti 


t\ X kJJLld 




STUDY ONE 




















TOTAL TOEFL 




• 8927 


.8916 


.9667 


146 


.8896 


.8881 


. 9584 


114 






QQ7Q 


OQ7Q 


Q1 70 




Q7QQ 


.0/00 


ft 9 fxA 

. O Z7 VJ *± 


38 


SWE TEST 




. 8752 


.8652 


.9016 


38 


.8752 


.8652 


.9016 


38 


VRC TEST 




.8808 


.8806 


.9326 


58 


.8617 


.8616 


. 9033 


38 


STUDY TWO 
























OQ7D 


QQ7Q 
• OZ/ / O 


Q1 70 
. Z? X / O 






• O y J u 


Q D77 
• z? \j t t 


4 5 


LCI 




. 8209 


.8209 


.8349 


20 


.7706 


.7669 


.7845 


15 


LC2 




.7541 


.7512 


.7618 


15 


.7541 


.7512 


.7618 


15 


LC3 




.7457 


.7437 


.7677 


15 


.7457 


.7437 


.7677 
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STUDY THREE 
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. 8752 
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38 


.8520 


.8513 


.8686 


28 


SWE1 




.7565 


.7478 


.7726 


14 


.7565 
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• 7726 
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. 8263 
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.7466 


.7370 


.7723 


14 


STUDY FOUR 




















VRC TEST 




.8808 


.8806 


.9326 


58 


.8808 


.8806 


.9326 


58 


VRC1 




.8749 


.8745 


.8854 


29 


.8749 


.8745 


.8854 


29 


VRC 2 
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TABLE 4: VARIANCE COMPONENTS FOR FIVE G-STUDIES 



VARIANCE COMPONENTS FOR 



STUDY STUDY STUDY STUDY STUDY 

ONE : TWO : THREE : FOUR : FIVE : 

TOTAL LC SWE VRC VRC2 

SOURCE TOEFL TEST TEST TEST SECTION 



RAW 

COMPONENTS 



p .0314 04 24 .04 0 554 00 

s .00247421 ♦ 00000000* 

i:s .01098074 .01178236 

ps .00720131 .00136198 

pi:s .16180465 .18380686 

Total .21386515 .23750520 



. 03 51712 6 . 03 614178 .03699441 

.00028243 .00060287 .00710587 

.00924644 .01189282 .00762986 

.00148391 .00327638 .01234403 

. 1511913 5 .15613 3 06 .1514 3 681 

.19737539 .20804691 .21551098 



*This value was a negative variance component, which was rounded 
to zero after Brennan (1983: 47-48) 
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TABLE 5; SUMMARY OF D-STUDY RESULTS 



D-STUDY RESULTS FOR 



STUDY 
ONE: 

MODEL TOTAL 
Statistic TOEFL 



STUDY 

TWO: 

LC 

TEST 



STUDY 
THREE : 
SWE 
TEST 



STUDY 
FOUR: 
VRC 
TEST 



STUDY 
FIVE: 
VRC 2 
SECTION 



n s 


3 


3 


Hi 


38 


15 




114 


45 


6 2 (p) 


. 0314 


. 0406 


d 2 (S) 


. 0008 


. 0000 


d 2 (i:s) 


.0001 


. 0003 


6 2 (PS) 


. 0024 


. 0005 


d 2 (pits) 


. 0014 


. 0041 




. 6927 


.6160 


RANDOM 






EFFECTS 






MODEL 






6 2 (7) 


. 0314 


. 0406 


d 2 (6) 


. 0038 


. 0045 


d 2 (A) 


. 0047 


. 0048 


Ed 2 (X) 


. 0352 


. 0451 


6 2 (X) 


. 0009 


. 0003 


Ep 2 (S) 


.8916 


. 8993 


S/N 


8.2251 


8 . 9305 



MIXED 

EFFECTS 

MODEL 



6* (T) 


. 0338 


. 0410 


6* (S) 


. 0014 


. 0041 


d* (A) 


. 0015 


. 0043 


Ed 2 (X) 


. 0352 


. 0451 


6 >* (X) 


. 0001 


. 0003 


Ep 2 (S) 


.9597 


.9094 


S/N 


23 . 8139 


10. 0375 



2 2 5 

14 29 4 

28 58 20 

.0352 .0361 .0370 

.0001 .0003 .0014 

.0003 .0002 .0004 

.0007 .0016 .0025 

.0054 .0027 .0076 

.7304 .7062 .6905 



.0352 .0361 .0370 

.0061 .0043 .0100 

.0066 .0048 .0118 

.0413 .0405 .0470 

.0005 .0005 .0018 

.8513 .8930 .7865 

5.7249 8.3458 3.6838 



.0359 .0378 .0395 

.0054 .0027 .0076 

.0057 .0029 .0080 

.0413 .0405 .0470 

.0003 .0002 .0004 

.8693 .9335 .8390 

6.6511 14.0376 5.2112 
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TABLE 11: SUMMARY OF PHI (LAMBDA) RESULTS 



PHI (LAMBDA) RESULTS FOR 



STUDY STUDY STUDY STUDY STUDY 

ONE : TWO : THREE : FOUR : FIVE : 

MODEL TOTAL LC SWE VRC VRC2 

Cut Point TOEFL TEST TEST TEST SECTION 



RANDOM 

EFFECTS 

MODEL 

90% .939 .962 

80% .899 .939 

70% .866 .908 

60% .892 .894 

50% .934 .918 

40% .961 .948 

30% .975 .967 

20% .983 .978 

10% .988 .985 

$(X) .865 .894 

X = 69% 62% 

MIXED 

EFFECTS 

MODEL 

90% .981 .965 

80% .968 .945 

70% .957 .917 

60% .965 .904 

50% .979 .926 

40% .987 .953 

30% .992 .970 

20% .995 .980 

10% .996 .985 

*(X) .957 .904 

X = 69% 62% 



.906 .938 .870 

.857 .902 .799 

.843 .881 .749 

•887 .907 .786 • 

•930 .942 .858 

•956 .964 .910 

•971 -976 -941 

.980 .984 .959 

.985 .988 .970 

.840 .880 .748 

73% 71% 69% 



.918 .963 .913 

.876 .941 .865 

.864 .928 .831 

.902 .944 .856 

.939 .965 .905 

.962 .978 .939 

.975 .986 .960 

.982 .990 .972 

.987 .993 .980 

.861 .928 .831 

73% 71% 69% 



APPENDIX A: G-STUDY RESULTS [P X (i:s) DESIGNS] 



STUDY ONE - TOTAL TOEFL BATTERY 



SOURCE 

P 
s 

i:s 
ps 

pi: s 



SS 

S0306. 73 
4200. 90 
24395.20 
17417 .30 
359188.37 



df 
19999 
2 

111 
39998 
2219889 



MS 

4.0155 
2100.4500 
219 . 7766 
0.4355 
0. 1618 



EVC 

0.03140424 
0.00247421 
0.01098074 
0.00720131 
0. 16180465 



STUDY TWO - LC TEST 



SOURCE 

P 
s 

i:s 
ps 

pi: s 



SS 

40581. 31 
183 . 80 
9904 . 90 
8169 . 05 
154390. 04 



df 
19999 
2 
42 
39998 
839958 



MS 
2.0292 
91.9020 
235.8310 
0. 2042 
0. 1838 



EVC 

0.04055400 
0.00000000 
0.01178236 
0.00136198 
0. 18380686 



STUDY THREE - SWE TEST 



SOURCE 

P 
s 

i: s 
ps 

pi: s 



SS 

23134 . 07 
264. 18 
4812 . 08 
3439 . 15 

78615. 57 



df 
19999 
1 
26 
19999 
519974 



MS 
1. 1568 
264. 1800 
185. 0800 
0. 1720 
0. 1512 



EVC 

0.03517126 
0.00028243 
0. 00924644 
0.00148391 
0. 15119135 



STUDY FOUR - VRC TEST 



SOURCE 

P 
s 

i: s 
ps 

pi:s 



SS 

46945.08 
587.77 
13328.70 
5022.71 



df 
19999 
1 

56 
19999 



174860.28 1119944 



MS 
2. 3474 
587.7720 
238.0125 
0. 2511 
0. 1561 



EVC 

0.03614178 
0. 00060287 
0.01189282 
0.00327638 
0. 15613306 



STUDY FIVE - VRC 2 SECTION 



SOURCE 

P 

s 

i:s 
ps 

pi: s 



SS 

18813 . 08 
2885.07 
2291. 23 
16064. 23 
45428.77 



df 
19999 
4 
15 
79996 
299985 



MS 
0.9407 
721.2675 
152.7487 
0.2008 
0. 1514 



EVC 

0. 03699441 
0.00710587 
0.00762986 
0. 01234403 
0. 15143681 
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